What features are most important for used car auctions, and is the industry standard (MMR) accurate?¶

By Sky Reznik, John Lackey and Kevin Abatto

In [187]:
import pandas as pd; # type: ignore
import numpy as np; # type: ignore
import seaborn as sns; # type: ignore
import matplotlib.pyplot as plt; # type: ignore
from holoviews.operation import histogram;
from sklearn.preprocessing import OneHotEncoder; # type: ignore
from sklearn.model_selection import train_test_split;
from sklearn.linear_model import LinearRegression;
from sklearn.metrics import root_mean_squared_error, mean_squared_error, r2_score;

import os;
In [189]:
# Load CSV into a pandas dataframe
shared_file_path = './kaggle_datasets/car_prices.csv'
# line 408,163 - "Model" field contains a comma (SE PZEV w/Connectivity, Navigation) - specify quotechar='"'
# This tells Pandas to treat anything inside double quotes as a single field, even if it contains commas.
#      solution provided by ChatGPT
df = pd.read_csv(shared_file_path, quotechar='"', on_bad_lines='skip', low_memory=False)

Now we need at append the MSRP of these cars. By using a data set that has features: make, model, year, transmission, etc. we can match the values and add in the MSRP.

We have excluded trim and chosen the lowest price match to avoid over saturating the data with multiple trim levels at different prices of the same vehicle.

In [191]:
# Load datasets
msrp = pd.read_csv('MSRP.csv')

# Standardize column names and values
msrp.rename(columns={
    'Make': 'make',
    "Model": "model",
    "Year": "year",
    "Transmission Type": "transmission"
}, inplace=True)

msrp['transmission'] = msrp['transmission'].str.lower()

# Create matching keys
df['match_key'] = (
    df['year'].astype(str).str.lower() + '_' + 
    df['make'].str.lower() + '_' + 
    df['model'].str.lower() + '_' + 
    df['transmission'].str.lower()
)

msrp['match_key'] = (
    msrp['year'].astype(str).str.lower() + '_' + 
    msrp['make'].str.lower() + '_' + 
    msrp['model'].str.lower() + '_' + 
    msrp['transmission'].str.lower()
)

# Sort by MSRP (ascending) and keep first (min) for each match_key
msrp_min_row = msrp.sort_values('MSRP').drop_duplicates('match_key', keep='first')

# Merge with original car data (keeping all car rows but adding MSRP where matched)
car_with_msrp = df.merge(
    msrp_min_row[['match_key', 'MSRP']],  # Only keep needed columns
    on='match_key',
    how='left'
)

# Drop the temporary key column
car_with_msrp = car_with_msrp.drop(columns=['match_key'])

# Save the merged DataFrame to a new CSV file
car_with_msrp.to_csv('car_prices_with_msrp.csv', index=False)

Data Cleaning¶

  1. Remove any row with missing data with df.dropna()
  2. One-Hot encode ['transmission'] to ['automatic_trans'] 0 /1 (double check unique vals)
  3. Simplify ['body'] (collapse all 85 body types to 9 types)
  4. Convert ['saledate']
  5. Age of car (from ['saledate'])
  6. Drop negative age cars
  7. Drop unessecary or redundant columns
  8. Make 'condition' numeric
  9. Avoid outlier by using the 95th percentile
In [193]:
# now we can use the car_prices_with_msrp DataFrame
df = pd.read_csv('car_prices_with_msrp.csv', low_memory=False)
# Remove 'Unamed: 16' column
df.drop('Unnamed: 16', axis=1, inplace=True)
df['odometer'] = np.where(df['odometer'] == 999999.0, np.nan, df['odometer'])

1 - Remove rows that have missing values

In [196]:
df.dropna(inplace=True)

2 - One-hot encode the 'transmission' column

In [198]:
print(df['transmission'].unique())
df['auto_transmission'] = np.where(df['transmission'].str.contains('automatic', case=False), 1, 0)
['automatic' 'manual']

3 - 'One-hot encode' the 'body' column (count 85 unique values --> reduce to 8)

In [200]:
# ['suv' 'sedan' 'convertible' 'coupe' 'wagon' 'hatchback' 'truck' 'minivan' 'van']
print(df['body'].unique())
df['body_type'] = np.nan
df['body_type'] = np.where(df['body'].str.contains('minivan', case=False), 'minivan', df['body_type'])
df['body_type'] = np.where(df['body'].str.contains('sedan', case=False), 'sedan', df['body'])
df['body_type'] = np.where(df['body'].str.contains('wagon', case=False), 'wagon', df['body_type'])
df['body_type'] = np.where(df['body'].str.contains('coupe', case=False), 'coupe', df['body_type'])
df['body_type'] = np.where(df['body'].str.contains('koup', case=False), 'coupe', df['body_type'])
df['body_type'] = np.where(df['body'].str.contains('convertible', case=False), 'convertible', df['body_type'])
df['body_type'] = np.where(df['body'].str.contains('hatchback', case=False), 'hatchback', df['body_type'])
df['body_type'] = np.where(df['body'].str.contains(r'\bvan\b', case=False), 'van', df['body_type'])
df['body_type'] = np.where(df['body'].str.contains('truck', case=False), 'truck', df['body_type'])
df['body_type'] = np.where(df['body'].str.contains('cab', case=False), 'truck', df['body_type'])
df['body_type'] = np.where(df['body'].str.contains('crew', case=False), 'truck', df['body_type'])
df['body_type'] = np.where(df['body'].str.contains('suv', case=False), 'suv', df['body_type'])
df['body_type'] = np.where(df['body'].str.contains('Minivan', case=False), 'minivan', df['body_type'])

# Numerical mapping for body types
size_mapping = {
    'convertible': 0,
    'coupe': 1,
    'hatchback': 2,
    'sedan': 3,
    'wagon': 4,
    'suv': 5,
    'minivan': 6,
    'truck': 7,
    'van': 8
}

df['body_size'] = df['body_type'].map(size_mapping)
['SUV' 'Sedan' 'Wagon' 'Hatchback' 'G Coupe' 'G Sedan' 'Elantra Coupe'
 'Coupe' 'CTS Coupe' 'E-Series Van' 'Van' 'Extended Cab' 'G Convertible'
 'Minivan' 'G37 Convertible' 'Convertible' 'Crew Cab' 'Quad Cab'
 'Regular Cab' 'Double Cab' 'Genesis Coupe' 'Q60 Convertible' 'SuperCab'
 'G37 Coupe' 'Cab Plus 4' 'Q60 Coupe' 'TSX Sport Wagon' 'CTS Wagon'
 'CTS-V Coupe' 'Beetle Convertible' 'CTS-V Wagon' 'Ram Van' 'sedan'
 'Cab Plus' 'van' 'King Cab' 'Access Cab' 'g sedan' 'g coupe' 'suv'
 'g convertible' 'e-series van' 'wagon' 'g37 convertible' 'minivan'
 'hatchback' 'convertible' 'crew cab' 'quad cab' 'cts coupe' 'coupe'
 'regular cab' 'elantra coupe' 'supercab' 'extended cab' 'q60 coupe'
 'king cab' 'double cab' 'access cab' 'genesis coupe' 'tsx sport wagon'
 'mega cab' 'q60 convertible' 'cab plus 4' 'SuperCrew']

4 - Converting 'saledate' to datetime (solution provided by ChatGPT)

In [202]:
# Handle invalid or unexpected values in the 'saledate' column
# Extract just the date part (e.g., "Dec 16 2014") before conversion
df['saledate'] = pd.to_datetime(
    df['saledate'].str.extract(r'(\w{3} \d{2} \d{4})')[0], 
    format='%b %d %Y', 
    errors='coerce'
)

5 - Create a new column 'car_age'

In [204]:
df['car_age'] = np.where(df['saledate'].notna(), df['saledate'].dt.year - df['year'], np.nan)

6 - Some car ages are negative (which is not possible) because a 2015 model year can exist in 2014 and subsequently be sold

In [207]:
# For rows with values less than 0, drop rows
print("Number of negative car ages dropped: ", df[df['car_age'] < 0].shape[0])
df.drop(df[df['car_age'] < 0].index, inplace=True)
# Drop old columns 'transmission' & 'body'
df.drop(['transmission', 'body'], axis=1, inplace=True)
Number of negative car ages dropped:  135

7 - Drop unessecary or redundant columns

In [209]:
df.drop(['year', 'trim', 'vin', 'color', 'interior', 'saledate', 'body_type'], axis=1, inplace=True)

8 - Make 'condition' numeric

In [215]:
df['condition'] = pd.to_numeric(df['condition'], errors='coerce')

9 - Use only th 95th percentile of the data to avoid outliers

In [222]:
sns.histplot(data=df, x='sellingprice')
threshold = df['sellingprice'].quantile(0.95)
df = df[df['sellingprice'] <= threshold]
sns.histplot(data=df, x='sellingprice')
plt.show()
No description has been provided for this image

10 - Creating 'sold_above_mmr'

MMR is provided in the dataset... it is the "Mannheim Market Report", an estimation of a car's selling value, which is updated nightly, and trained on millions of auction transactions. For our auction data, we can assume it is a sellers goal to surpass the MMR in the auction.

Therefore, a relevant one-hot-encoding would be if the sellingprice > mmr. While it is good practice to not have redundant columns; ones which can be inferred from the data, a binary column sold_above_mmr would be highly useful for geographical visualizations and determining which sellers or states outpreform their estimated MMR.

We will create this variable below:

In [224]:
# ensure mmr and sellingprice are numeric
df['mmr'] = pd.to_numeric(df['mmr'], errors='coerce')
df['sellingprice'] = pd.to_numeric(df['sellingprice'], errors='coerce')
df['sold_above_mmr'] = np.where(df['sellingprice'] > df['mmr'], 1, 0 )
df[['sellingprice', 'mmr', 'sold_above_mmr']].sample(6)
Out[224]:
sellingprice mmr sold_above_mmr
353285 1300 3750 0
543126 8700 10050 0
275107 6400 5050 1
222296 20000 20100 0
483417 17600 18600 0
397561 7300 6975 1
In [227]:
# Save clean data to CSV
df.to_csv('cleaned_car_prices.csv', index=False)
In [29]:
# Re-load the cleaned data
df = pd.read_csv('cleaned_car_prices.csv', low_memory=False)

Exploratory Data Analysis¶

First, let's learn more about our data now that is has been cleaned!

In [32]:
print(df.columns.unique())
Index(['make', 'model', 'state', 'condition', 'odometer', 'seller', 'mmr',
       'sellingprice', 'MSRP', 'auto_transmission', 'body_size', 'car_age',
       'sold_above_mmr'],
      dtype='object')

DataFrame Features Overview¶

1. make¶
  • Description: The manufacturer/brand of the vehicle
  • Type: Categorical (e.g., Toyota, Ford, Honda)
  • Potential Use: Grouping vehicles by brand for analysis
2. model¶
  • Description: The specific model name of the vehicle
  • Type: Categorical (e.g., Camry, F-150, Escape)
  • Potential Use: Detailed vehicle identification when combined with make
3. state¶
  • Description: The geographical state where the vehicle is located/registered
  • Type: Categorical (e.g., CA, TX, NY)
  • Potential Use: Regional price analysis or demand patterns
4. condition¶
  • Description: The physical/mechanical condition of the vehicle
  • Type: Numerical (0-5)
  • Potential Use: Key factor in pricing models
5. odometer¶
  • Description: The mileage reading showing how many miles the vehicle has traveled
  • Type: Numerical (continuous)
  • Potential Use: Strong predictor of vehicle value and wear
6. seller¶
  • Description: The party selling the vehicle
  • Type: Categorical
  • Potential Use: Analyzing price differences between seller types
7. mmr (Manheim Market Report)¶
  • Description: Wholesale market valuation of the vehicle caluculated and updated every 24 hours
  • Type: Numerical (currency USD)
  • Potential Use: Benchmark for comparing selling prices
8. sellingprice¶
  • Description: The actual auction sale price of the vehicle
  • Type: Numerical (currency USD)
  • Potential Use: Target variable for price prediction models
9. MSRP (Manufacturer's Suggested Retail Price)¶
  • Description: The original new vehicle price recommended by manufacturer
  • Type: Numerical (currency USD)
  • Potential Use: Baseline for depreciation calculations
10. auto_transmission¶
  • Description: Whether the vehicle has automatic transmission
  • Type: Boolean (0/1)
  • Potential Use: Analyzing price differences between transmission types
11. body_size¶
  • Description: The size classification of the vehicle
  • Type: Ordinal numerical (1=smallest to 8=largest)
  • Potential Use: Market segment analysis
12. car_age¶
  • Description: The age of the vehicle in years
  • Type: Numerical (discrete)
  • Potential Use: Key factor in depreciation models
13. sold_above_mmr¶
  • Description: The age of the vehicle in years
  • Type: Boolean (1/0)
  • Potential Use: Determining shortcomings of MMR model

Here is a look at the format of the dataframe and what a line look like:

In [35]:
print(df.sample(1))
      make model state  condition  odometer                seller    mmr  \
32537  Kia  Soul    ca        4.3   37538.0  avis rac/san leandro  12900   

       sellingprice     MSRP  auto_transmission  body_size  car_age  \
32537         13700  16900.0                  1          4      1.0   

       sold_above_mmr  
32537               1  

Boxplots¶

First, lets look at some bloxplots of our numerical features to understand the spread of our data.

In [37]:
sns.set(style="whitegrid")

# Select numerical columns
numerical_cols = ['odometer', 'condition', 'sellingprice', 'MSRP', 'car_age', 'body_size']

# Create subplots
plt.figure(figsize=(12, 8))
for i, col in enumerate(numerical_cols, 1):
    plt.subplot(3, 2, i)  
    sns.boxplot(y=df[col], color='skyblue')
    plt.title(f'Boxplot of {col}')
    plt.ylabel('')

plt.tight_layout()  # Prevent overlapping
plt.show()
No description has been provided for this image
In [38]:
plt.figure(figsize = (12, 12))
sns.heatmap(
    df[numerical_cols].corr(),
    annot = True,
    fmt = '.2f',
    cmap = 'coolwarm_r',
    vmin = -1, 
    vmax = 1
);
No description has been provided for this image

For fun, let's do some visualizations

In [40]:
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
#https://geopandas.org/en/stable/docs/user_guide/mapping.html

# Verify the installation by printing the version
print(f"GeoPandas version: {gpd.__version__}")

gdf = gpd.read_file('tl_2024_us_state/tl_2024_us_state.shp') #read file from US census
gdf = gdf.to_crs("EPSG:4326")

df2 = pd.read_csv('cleaned_car_prices.csv')

sales_counts = df2['state'].value_counts().reset_index()
sales_counts.rename(columns={'state':'STUSPS'}, inplace=True) # rename for merge
sales_counts['STUSPS'] = sales_counts['STUSPS'].str.upper() # uppercase for merge

merged_df = gdf.merge(sales_counts, how='left', on='STUSPS')

non_continental = ['HI','VI','MP','GU','AK','AS','PR']
us49 = merged_df
for n in non_continental:
    us49 = us49[us49.STUSPS != n]

# format and convert NaN's to zeroes
us49.rename(columns={'count':'NUM_SALES'}, inplace=True)
us49.fillna(0, inplace=True)

from mpl_toolkits.axes_grid1 import make_axes_locatable
from matplotlib.pyplot import colormaps

# Source for function https://medium.com/@jl_ruiz/plot-maps-from-the-us-census-bureau-using-geopandas-and-contextily-in-python-df787647ef77
def StatesPlot(df,data, cmap):
    f,ax = plt.subplots(1,1, figsize=(15,10),
    sharex=True, sharey=True, dpi=300)
    f.tight_layout()
    plt.title('Car sales per state')
    ax.set_axis_off()
    divider = make_axes_locatable(ax)
    cax = divider.append_axes("right", size="4%",
    pad=0.1,alpha=0.01)
    df.plot(data, ax=ax, alpha=0.5, cmap=cmap,
    edgecolor='none', legend=True, cax=cax, linewidth=0.3)
    plt.ylabel('Sales', fontsize=12)
    plt.show()

StatesPlot(us49, 'NUM_SALES', cmap='Wistia')
GeoPandas version: 1.0.1
No description has been provided for this image

We can see that a majority of sales are California, Texas, and Florida. While it isn't relevant to other questions, we can guess that maybe this is due to limitations of the dataset, and simply state populations

In [42]:
import plotly.express as px
# Sample data
treemap_df = df2['make'].value_counts().reset_index()
treemap_df.columns = ['make', 'count']

make_to_country = {
    "Infiniti": "Japan",
    "Dodge": "USA",
    "Chevrolet": "USA",
    "Chrysler": "USA",
    "Kia": "South Korea",
    "Ford": "USA",
    "Lexus": "Japan",
    "Pontiac": "USA",
    "Nissan": "Japan",
    "Hyundai": "South Korea",
    "Acura": "Japan",
    "GMC": "USA",
    "Mazda": "Japan",
    "Buick": "USA",
    "Toyota": "Japan",
    "Lincoln": "USA",
    "Volkswagen": "Germany",
    "Cadillac": "USA",
    "Honda": "Japan",
    "Suzuki": "Japan",
    "Subaru": "Japan",
    "Volvo": "Sweden",
    "Mitsubishi": "Japan",
    "Mercedes-Benz": "Germany",
    "Scion": "Japan",
    "Oldsmobile": "USA",
    "BMW": "Germany",
    "HUMMER": "USA",
    "Land Rover": "UK",
    "Saab": "Sweden",
    "Audi": "Germany",
    "Porsche": "Germany",
    "Plymouth": "USA",
    "FIAT": "Italy",
    "Maserati": "Italy"
}

treemap_df['country'] = treemap_df['make'].map(make_to_country)

fig = px.treemap(
    treemap_df,
    path=['country','make'],
    values='count',
    color='country',
color_discrete_sequence=px.colors.qualitative.Safe)

fig.update_layout(margin=dict(t=30, l=10, r=10, b=10))
fig.show()

Data Predictions¶

'mmr' v. 'sellingprice'¶

In [44]:
import matplotlib.pyplot as plt
import seaborn as sns

sns.set(style="whitegrid")
plt.scatter(df['sellingprice'], df['mmr'], alpha=0.2, s=5)  
sns.regplot(x='sellingprice', y='mmr', data=df, scatter=False, color='red')  
plt.title('Selling Price vs MMR')  
plt.xlabel('Selling Price')  
plt.ylabel('MMR') 
plt.show()
No description has been provided for this image
In [45]:
X = df['mmr']  # Predictor 
y = df['sellingprice']  # Target 
r_squared = r2_score(y, X)
print(f"R-squared (R²) between MMR and Selling Price: {r_squared}")

rmse = root_mean_squared_error(df['sellingprice'], df['mmr'])
# Create a DataFrame to store R² values
results = pd.DataFrame(columns=['Predictor', 'R_Squared', 'RMSE'])
results.loc[0] = ['MMR', r_squared, rmse]
results
R-squared (R²) between MMR and Selling Price: 0.9538178923973161
Out[45]:
Predictor R_Squared RMSE
0 MMR 0.953818 1712.649879

Here we can get a baseline understanding of how accurate MMR is. With an R² of .9538, it explains 95% of 'sellingprice' variance. We will see if we can create a model better than the industry standard!

Using features to predict selling price with Linear Regression¶

List of numeric features:

  1. year
  2. condition
  3. odometer
  4. auto_transmission
  5. car_age
  6. MSRP
In [48]:
df_predictor = df[['condition', 'odometer', 'auto_transmission', 'car_age', 'MSRP', 'body_size']]
df_target = df['sellingprice']
Xlr = df_predictor
ylr = df_target

# Split into training and testing set
Xlr_train, Xlr_test, Ylr_train, Ylr_test = train_test_split(Xlr, ylr, test_size=0.3, random_state=39)
# Train the Linear Regression model
model = LinearRegression()
model.fit(Xlr_train, Ylr_train)
Out[48]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()

Make predictions and evaluate the model

In [50]:
ylr_pred = model.predict(Xlr_test)
ylr_rmse = root_mean_squared_error(Ylr_test, ylr_pred)
print(f"Root Mean Squared Error (MSE): {ylr_rmse}")

r2_linreg = r2_score(Ylr_test, ylr_pred)
print(f"R^2 (coefficient of determination): {r2_linreg}")
Root Mean Squared Error (MSE): 3465.8965609620914
R^2 (coefficient of determination): 0.8128656644603673

So, unsuprisingly, our linear regression model did not preform as well as MMR did.

In [52]:
results.loc[1] = ['Linear Regression', r2_linreg, ylr_rmse]
results.sort_values(by='R_Squared', ascending=False, inplace=True)
print(results)
           Predictor  R_Squared         RMSE
0                MMR   0.953818  1712.649879
1  Linear Regression   0.812866  3465.896561

Ploting predictions from linear regression model vs. actual selling price

In [54]:
plt.figure(figsize=(10,6))
plt.scatter(Ylr_test, ylr_pred, alpha=0.5, color='blue') 
plt.plot([min(Ylr_test), max(Ylr_test)], [min(Ylr_test), max(Ylr_test)], color='red', linewidth=1.5) # Perfect prediction line
plt.xlabel('Actual Selling Price') 
plt.ylabel('Linear Regression Predicted Selling Price')
plt.title('Actual vs Linear Regression Predicted Selling Price')
plt.show()
No description has been provided for this image

Standardization and PCA¶

To visualize using PCA, we must standardize the features using StandardScaler() from sklearn.preprocessing

In [56]:
numerical_cols = ['condition', 'odometer', 'auto_transmission', 'car_age', 'MSRP', 'body_size']

from sklearn.preprocessing import StandardScaler
# Starting the engine
scaler = StandardScaler()

df_stand = pd.DataFrame(
    scaler.fit_transform(df[numerical_cols]),
    # Keeping the index and 
    index = df.index,
    columns = df[numerical_cols].columns
)

df_stand
Out[56]:
condition odometer auto_transmission car_age MSRP body_size
0 0.582548 -0.826618 0.177124 -1.059310 0.657560 -0.488932
1 -1.582630 -1.005590 0.177124 -1.059310 -0.602447 -0.488932
2 -1.685734 -0.821365 0.177124 -1.059310 -0.070444 -0.488932
3 0.685652 -0.780987 0.177124 -1.059310 0.902562 0.086034
4 1.304274 -0.858563 0.177124 -1.059310 0.902562 0.086034
... ... ... ... ... ... ...
51641 0.685652 -1.036201 0.177124 -0.789743 -0.719115 -0.488932
51642 1.304274 -1.011888 0.177124 -0.789743 -0.530113 0.661001
51643 0.170133 -0.949208 -5.645760 -0.789743 -0.299695 -1.063898
51644 -0.860904 -0.032231 0.177124 1.097225 -0.540030 -0.488932
51645 0.273237 -0.777869 0.177124 -1.059310 -0.625781 -0.488932

51646 rows × 6 columns

In [57]:
from sklearn.decomposition import PCA
df_pca = PCA(n_components = 2).fit(df_stand)

print(f'The variance retained by the first PC is: {round(df_pca.explained_variance_ratio_[0]*100, 2)}%')
print(f'The variance retained by the second PC is: {round(df_pca.explained_variance_ratio_[1]*100, 2)}%')
print(f'for a total variance retained of: {round(sum(df_pca.explained_variance_ratio_)*100, 2)}%')
The variance retained by the first PC is: 39.06%
The variance retained by the second PC is: 18.8%
for a total variance retained of: 57.87%
In [58]:
# Source: 19_PCA.ipynb

# Get PCA scores (coordinates of observations in PC space)
scores = df_pca.transform(df_stand)

# Get PCA loadings (contributions of original features to PCs)
loadings = df_pca.components_.T * np.sqrt(df_pca.explained_variance_)

# Create figure
plt.figure(figsize=(10, 8))

scatter = plt.scatter(
    x=scores[:, 0],  # PC1 scores
    y=scores[:, 1],  # PC2 scores
    alpha=0.5,
    s=30
)

for i, feature in enumerate(numerical_cols):
    plt.arrow(
        x=0, y=0,
        dx=loadings[i, 0]*3,  # Scaling factor for visibility
        dy=loadings[i, 1]*3,
        color='red',
        alpha=0.8,
        head_width=0.1
    )
    plt.text(       # https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.text.html
        loadings[i, 0]*3.2,  
        loadings[i, 1]*3.2,
        feature,
        horizontalalignment='center',  
        color='red',
        fontsize=12
    )

plt.title('PCA Biplot', fontsize=14)
plt.xlabel(f'PC1 ({round(df_pca.explained_variance_ratio_[0]*100, 1)}% Variance)')
plt.ylabel(f'PC2 ({round(df_pca.explained_variance_ratio_[1]*100, 1)}% Variance)')
plt.grid(alpha=0.3)
plt.axhline(0, color='grey', ls='--')
plt.axvline(0, color='grey', ls='--')

# Add variance explanation in legend - https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html
plt.legend(
    handles=[plt.Line2D([0], [0], marker='o', color='w', label=f'Total Variance Explained: {round(sum(df_pca.explained_variance_ratio_)*100, 1)}%',
             markerfacecolor='blue', markersize=10)],
    loc='best'
)

plt.tight_layout()
plt.show()
No description has been provided for this image

Now let's see how the first two PC's can preform when predicting 'sellingprice'

In [60]:
X_pca = df_pca.transform(df_stand)

y_pca = df['sellingprice']
X_pca = X_pca[:, [0, 1]]

X_pca_train, X_pca_test, Y_pca_train, Y_pca_test = train_test_split(X_pca, y_pca, test_size=0.3, random_state=39)

model = LinearRegression()

model.fit(X_pca[:, [0, 1]], y_pca)
Out[60]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [61]:
y_pca_pred = model.predict(X_pca_test)
mse = root_mean_squared_error(Y_pca_test, y_pca_pred)
print(f"Root Mean Squared Error (RMSE): {mse}")

r2_pca2 = r2_score(Y_pca_test, y_pca_pred)
print(f"R^2 (coefficient of determination): {r2_pca2}")
Root Mean Squared Error (RMSE): 4510.473314190347
R^2 (coefficient of determination): 0.6830676866848165
In [62]:
plt.figure(figsize=(10,6))
plt.scatter(Y_pca_test, y_pca_pred, alpha=0.5, color='red') 
plt.plot([min(Y_pca_test), max(Y_pca_test)], [min(Y_pca_test), max(Y_pca_test)], color='blue', linewidth=1.5) # Perfect prediction line
plt.xlabel('Actual Selling Price') 
plt.ylabel('Two Principle Component Predicted Selling Price')
plt.title('Actual vs Two Principle Component Predicted Selling Price')
plt.show()
No description has been provided for this image
In [63]:
# Predict and calculate R²
y_pred = model.predict(X_pca[:, [0, 1]])
r_squared = r2_score(y, y_pred)

y_pred_pred = model.predict(X_pca_test)
rmse = root_mean_squared_error(Y_pca_test, y_pca_pred)

results.loc[2] = ['PCA (2 pricipal components)', r_squared, rmse]
results.sort_values(by='R_Squared', ascending=False, inplace=True)
print(results)
                     Predictor  R_Squared         RMSE
0                          MMR   0.953818  1712.649879
1            Linear Regression   0.812866  3465.896561
2  PCA (2 pricipal components)   0.683350  4510.473314

Unsurprisingly, it preforms pretty similar to the linear regression did with all the features, but it is impreessive how little it lost when moving to a 2-D reduction!

XGBoost¶

Let's see if XGBoost can do any better, first without tuning paremeters to find a baseline

In [66]:
from xgboost import XGBRegressor

X_train, X_test, y_train, y_test = train_test_split(df_predictor, df_target, test_size=0.3, random_state=39) # Create training set

# Materials sourced from lecture slides
In [67]:
xgb_total = (
    XGBRegressor(
        max_depth = 5, 
        n_estimators = 200,
    )
    .fit(X_train, y_train)
)

# Calculating the R-squared using the test data
xgb_R2 = xgb_total.score(X_test, y_test)

y_pred = xgb_total.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)
print(f'R-squared value before tuning: {xgb_R2}')
results.loc[3] = ['XGBoost (pretuned)', xgb_R2, rmse]

# Sort results by R-squared values
results.sort_values(by='R_Squared', ascending=False, inplace=True)
print(results)
R-squared value before tuning: 0.9538455605506897
                     Predictor  R_Squared         RMSE
3           XGBoost (pretuned)   0.953846  1721.257059
0                          MMR   0.953818  1712.649879
1            Linear Regression   0.812866  3465.896561
2  PCA (2 pricipal components)   0.683350  4510.473314

As we know, XGBoost has many hyper parameters. Let's do a grid search to find the best combination

In [69]:
xgb_hp_dict = {
    'learning_rate':    [0.01, 0.05, 0.1, 0.2, 0.3], # Each new prediction's contribution to y_hat
    'n_estimators':     [100, 250, 500, 750, 1000],  # Number of trees to make
    'max_depth':        [1, 3, 5, 7],                # Maximum number of splits: 1, 8, 32, 128 leaf nodes   
    'colsample_bytree': [0.4, 0.6, 0.8, 1],          # Number of features to use per tree (40, 60, 80, & 100%)
    'lambda':           [0, 0.5, 1, 2, 5]            # Regularization parameters to try
}
In [110]:
from sklearn.model_selection import GridSearchCV, KFold

# 5 fold cross-val
cv5 = KFold(n_splits = 5, shuffle = True)

# Defining the model:
xgb_reg = XGBRegressor(random_state = 3870, early_stopping_rounds = 50)

# Defining the grid search
grid_search = GridSearchCV(
    estimator = xgb_reg,
    param_grid = xgb_hp_dict,
    n_jobs = -1,
    cv = cv5    
)
In [112]:
import time

curr_time = time.time()
gs_results = (
    grid_search
    .fit(
        X_train, y_train,
        eval_set = [(X_test, y_test)],
        verbose = False
    )
)
end_time = time.time()

print(f'Wow! That took {end_time - curr_time} seconds')
/opt/anaconda3/lib/python3.12/site-packages/joblib/externals/loky/process_executor.py:752: UserWarning:

A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.

Wow! That took 539.7350447177887 seconds
In [119]:
## run to skip grid search
best_hps = {'colsample_bytree': 0.8,
 'lambda': 5,
 'learning_rate': 0.05,
 'max_depth': 7,
 'n_estimators': 1000}
In [121]:
# best_hps = grid_search.best_params_
print(f'After many iterations... the best paremeters are')
best_hps
After many iterations... the best paremeters are
Out[121]:
{'colsample_bytree': 0.8,
 'lambda': 5,
 'learning_rate': 0.05,
 'max_depth': 7,
 'n_estimators': 1000}

Now that the hard work is done, let's use these hyperparemeters

In [124]:
best_xgb = (
    XGBRegressor(
        **best_hps,
    )
    .fit(
        X_train, y_train,
        eval_set = [(X_test, y_test)],
        verbose = False
    )
)

y_pred = best_xgb.predict(X_test)
rmse = root_mean_squared_error(y_test, y_pred)

xgb_tuned_r2 = best_xgb.score(X_test, y_test)
results.loc[4] = ['XGB (tuned)', xgb_tuned_r2, rmse]
print(results.sort_values(by = 'R_Squared', ascending = False))
                     Predictor  R_Squared         RMSE
4                  XGB (tuned)   0.956010  1680.416265
3           XGBoost (pretuned)   0.953846  1721.257059
0                          MMR   0.953818  1712.649879
1            Linear Regression   0.812866  3465.896561
2  PCA (2 pricipal components)   0.683350  4510.473314

Wow! The r-squared AND rmse is higher than MMR after tuning

In [132]:
import xgboost
xgboost.plot_importance(best_xgb)
plt.show()
No description has been provided for this image

Random Forest¶

First let's import the proper packages

In [135]:
from sklearn.ensemble import RandomForestRegressor

Next we need to select the same predictor features as before and train-test split the data

In [138]:
df_predictor = df[['condition', 'odometer', 'auto_transmission', 'car_age', 'MSRP', 'body_size']]
df_target = df['sellingprice']

X_rf_train, X_rf_test, Y_rf_train, Y_rf_test = train_test_split(df_predictor, df_target, test_size=0.3, random_state=39)

Now we can start the Random Forest

In [141]:
rf_model = RandomForestRegressor(n_estimators=100, random_state=31)
rf_model.fit(X_rf_train, Y_rf_train)
Out[141]:
RandomForestRegressor(random_state=31)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(random_state=31)

From here, we can make predicitons and look at the model's accuracy.

In [143]:
Y_rf_pred = rf_model.predict(X_rf_test)

mse_rf = mean_squared_error(Y_rf_test, Y_rf_pred)
r2_rf = r2_score(Y_rf_test, Y_rf_pred)

print(f"Random Forest MSE: {mse_rf}")
print(f"Random Forest R^2: {r2_rf}")
Random Forest MSE: 3452615.4366686707
Random Forest R^2: 0.9462138457344813

The Random Forest Model gives us a very strong R^2 value. Let's visualize the actual vs predicted values next.

In [147]:
plt.figure(figsize=(10,6))
plt.scatter(Y_rf_test, Y_rf_pred, alpha=0.5, color='blue')
plt.plot([min(Y_rf_test), max(Y_rf_test)], [min(Y_rf_test), max(Y_rf_test)], color='red', linewidth=1.0) 
plt.ylabel('Random Forest Predicted Selling Price')
plt.title('Actual vs Random Forest Predicted Selling Price')
plt.show()
No description has been provided for this image

Feature Importance¶

We can also investigate what features were the most important in the splits of the trees.

In [150]:
feat_imp = pd.DataFrame({
    'feature': df_predictor.columns,
    'importance': rf_model.feature_importances_
}).sort_values(by='importance', ascending=False)

feat_imp
Out[150]:
feature importance
3 car_age 0.559273
4 MSRP 0.273112
1 odometer 0.087265
0 condition 0.043139
5 body_size 0.034839
2 auto_transmission 0.002371
In [152]:
feat_imp.set_index('feature').plot(kind = 'bar');
No description has been provided for this image

This means that the age of the car was the most important feature to the splitting of the trees

Tuning¶

In [156]:
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20,],
    'max_features': ['auto', 'sqrt']
}
In [ ]:
rf = RandomForestRegressor(random_state=31)

grid_search = GridSearchCV(estimator=rf, param_grid=param_grid,
                           cv=3, n_jobs=-1, verbose=1, scoring='r2')

grid_search.fit(X_rf_train, Y_rf_train)

Now we have to find the best predictor and fit the random forests to the training data set.

In [160]:
best_rf = grid_search.best_estimator_

Y_rf_pred = best_rf.predict(X_rf_test)

mse_rf = mean_squared_error(Y_rf_test, Y_rf_pred)
r2_rf = r2_score(Y_rf_test, Y_rf_pred)

print(f"Tuned Random Forest MSE: {mse_rf}")
print(f"Tuned Random Forest R^2: {r2_rf}")
Tuned Random Forest MSE: 3232353.2124780184
Tuned Random Forest R^2: 0.9496451742987235

We can see a slight improvement in the tuned model vs the non-tuned model. Let's take a look at the feature importance again to see if there is much difference.

In [163]:
feat_imp2 = pd.DataFrame({
    'feature': df_predictor.columns,
    'importance': best_rf.feature_importances_
}).sort_values(by='importance', ascending=False)

feat_imp2.set_index('feature').plot(kind='bar', title='Feature Importance', figsize=(8,5))
plt.ylabel('Importance')
plt.show()

feat_imp2
No description has been provided for this image
Out[163]:
feature importance
3 car_age 0.283894
1 odometer 0.277212
4 MSRP 0.270950
0 condition 0.120909
5 body_size 0.044340
2 auto_transmission 0.002695

We can see that there is much more importance for odometer and MSRP variables compared to the previous Random Forest and less on the car age.

Model Results¶
In [177]:
results.loc[5] = ['Random Forest Regression', r2_rf, mse_rf]
results.sort_values(by='R_Squared', ascending=False, inplace=True)
print(results)
                     Predictor  R_Squared          RMSE
4                  XGB (tuned)   0.956010  1.680416e+03
3           XGBoost (pretuned)   0.953846  1.721257e+03
0                          MMR   0.953818  1.712650e+03
5     Random Forest Regression   0.949645  3.232353e+06
1            Linear Regression   0.812866  3.465897e+03
2  PCA (2 pricipal components)   0.683350  4.510473e+03

Summary and Conclusion¶

To compare the models, let's plot the r-squared of all them. We will not be able to plot PCA.

In [180]:
model = LinearRegression() # retrain LR
model.fit(Xlr_train, Ylr_train)

y_pred_xgb = best_xgb.predict(X_test)
y_pred_rf = rf_model.predict(X_test)
y_pred_lr = model.predict(X_test)

plt.figure(figsize=(7, 7))
plt.scatter(y_test, y_pred_xgb, label='XGBoost (tuned)', alpha=1, color='darkred')
plt.scatter(y_test, y_pred_rf, label='Random forest', alpha=0.3, color='orange')
plt.scatter(y_test, y_pred_lr, label='Linear regression', alpha=0.1, color='yellow')
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'k-', lw=1)
plt.xlabel('Actual selling price')
plt.ylabel('Predicted selling price')
plt.grid(False)
plt.legend()
plt.ylim([-5000, y_test.max()])
plt.show()
No description has been provided for this image

Let's also plot the feature importances

In [183]:
#https://www.geeksforgeeks.org/how-to-generate-feature-importance-plots-from-scikit-learn/
feature_names = df_predictor.columns

feature_imps = pd.DataFrame({
    'feature': feature_names,
    'random forest': rf_model.feature_importances_,
    'xgboost': best_xgb.feature_importances_,
})

x = np.arange(len(feature_names))

width = 0.35
plt.xticks(x, feature_names, rotation=45)
plt.bar(x - width/2, feature_imps['random forest'], width, label='Random forest', alpha=0.7, color='darkgreen')
plt.bar(x + width/2, feature_imps['xgboost'],width, label='XGBoost', alpha=0.4, color='green')
plt.legend()
plt.ylabel('Feature importance')
plt.show()
No description has been provided for this image

Summary and Conclusion¶

· We were able to beat MMR, however we beat it by so little that it might not be as effective since it could be overfit to the data.

· R^2 can be inflated easily by adding features.

· The strongest model we built was a tuned XGBoost model, while the least effective model was PCA.

Limitations¶

· XGBoost time can be lengthy depending on the computer it is being run on.

· We were not able to use trim because of the specificity to each vehicle. Additionally, this variable was categorical. While there is a way to scale it, we did not have time to do that for every data value for this project.

· There were outliers in the data that we removed by using the 95th percentile for the data. While this helped us fit the model better, we did leave out 5% of the data in our analysis.

In [ ]: